DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

نویسندگان

چکیده

Abstract Finding word boundaries in continuous speech is challenging as there little or no equivalent of a ‘space’ delimiter between words. Popular Bayesian non-parametric models for text segmentation (Goldwater et al., 2006, 2009) use Dirichlet process to jointly segment sentences and build lexicon types. We introduce DP-Parse, which uses similar principles but only relies on an instance tokens, avoiding the clustering errors that arise with On Zero Resource Speech Benchmark 2017, our model sets new state-of-the-art 5 languages. The algorithm monotonically improves better input representations, achieving yet higher scores when fed weakly supervised inputs. Despite lacking type lexicon, DP-Parse can be pipelined language learn semantic syntactic representations assessed by spoken embedding benchmark. 1

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Word Boundaries in French: Evidence from Large Speech Corpora

The goal of this paper is to investigate French word segmentation strategies using phonemic and lexical transcriptions as well as prosodic and part-of-speech annotations. Average fundamental frequency (f0) profiles and phoneme duration profiles are measured using 13 hours of broadcast news speech to study prosodic regularities of French words. Some influential factors are taken into considerati...

متن کامل

Morphological Lexicon Extraction from Raw Text Data

We introduce a tool extract developed for automatic extraction of lemma-paradigm pairs from raw text data. The tool combines regular expressions containing variables with propositional logic to form search patterns which identify lemmas tagged with their paradigm class. Furthermore, we describe the underlying algorithm of the tool and suggest a method for developing a morphological lexicon. The...

متن کامل

Parsing with subdomain instance weighting from raw corpora

The treebanks that are used for training statistical parsers consist of hand-parsed sentences from a single source/domain like newspaper text. However, newspaper text concerns different subdomains of language use (e.g. finance, sports, politics, music), which implies that the statistics gathered by generative statistical parsers are averages over subdomain statistics. In this paper we explore a...

متن کامل

Learning the lexicon from raw texts for open-vocabulary Korean word recognition

In this paper, we propose a novel method of building a language model for open-vocabulary Korean word recognition. Due to the complex morphology of Korean, it is inappropriate to use lexicons based on the linguistic entities such as words and morphemes in openvocabulary domains. Instead, we build the lexicon by collecting variable length character sequences from the raw texts using a dynamic Ba...

متن کامل

Automatic Generation of Compound Word Lexicon for Hindi Speech Synthesis

This paper addresses the problem of Hindi compound word splitting and its relevance to developing a good quality phonetizer for Hindi Speech Synthesis. The constituents of a Hindi compound word are not separated by space or hyphen. Hence, most of the existing compound splitting algorithms can not be applied to Hindi. We propose a new technique for automatic extraction of compound words from Hin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Transactions of the Association for Computational Linguistics

سال: 2022

ISSN: ['2307-387X']

DOI: https://doi.org/10.1162/tacl_a_00505